Doc on handling worker with walltime#481
Conversation
mivade
left a comment
There was a problem hiding this comment.
Thanks, this looks like great additional documentation! I've pointed out some typos and some suggestions to clarify the language a bit, but otherwise this looks good!
| - when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends. | ||
| - when you really don't know how long your workload will take: all your workers could be killed before reaching the end. In this case, you'll want to use adaptive clusters so that Dask ensures some workers are always up. | ||
|
|
||
| If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases. |
|
|
||
| If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases. | ||
|
|
||
| The solution to this problem is to tell Dask up front that the workers have a finit life time: |
There was a problem hiding this comment.
Typo: finit -> finite. Similarly lifetime is usually spelled as a single word.
|
|
||
| The solution to this problem is to tell Dask up front that the workers have a finit life time: | ||
|
|
||
| - Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved. |
| How to handle job queueing system walltime killing workers | ||
| ---------------------------------------------------------- | ||
|
|
||
| In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems. |
There was a problem hiding this comment.
Should be "every worker process runs..."
| In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems. | ||
| Reaching walltime can be troublesome in several cases: | ||
|
|
||
| - when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends. |
There was a problem hiding this comment.
hopping -> hoping and "before you workload" -> "before your workload"
| The solution to this problem is to tell Dask up front that the workers have a finit life time: | ||
|
|
||
| - Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved. | ||
| - Use `--lifetime-stagger` when dealing with many workers (say > 20): this will allow to avoid workers all terminating at the same time, and so to ease rebalancing tasks and scheduling burden. |
There was a problem hiding this comment.
"this will allow to avoid workers all" -> "this will prevent workers from"
"and so to ease" -> "and so ease" or (probably better) "thus"
| cluster.adapt(minimum=0, maximum=200) | ||
|
|
||
|
|
||
| Here is an example of a workflow taking advantage of this, if you wan't to give it a try or adapt it to your use case: |
|
Many thanks @mivade! I need to practice my english... |
andersy005
left a comment
There was a problem hiding this comment.
Thank you for putting this together, @guillaumeeb!
Finally, a little contribution from me, and a doc fix to a long standing issue.
Fixes #122.